In their writing textbook The Craft of Research, authors Booth et al posit a key distinction between subjects and topics: “A subject is a broad area of knowledge (e.g. climate change), while a topic is a specific interest within that area (e.g. the effect of climate change on migratory birds). However,” they go on to clarify, “finding a topic is not simply a matter of narrowing your subject. A topic is an approach to a subject, one that asks a question whose answer solves a problem that your readers care about” (33).
In this lab, you will begin to generate questions that could form the basis for a data ethnography and analysis project, and to investigate what it might take to pursue answers to those questions. The first part of the lab is about topic selection and contextualization. The second part distinguishes among types of questions, and prompts you to consider what questions you have within your chosen topic or topics. Part three is informational: it gives an overview of what we mean when we say “open data,” and why so much of it is produced by government agencies. (For better or for worse: bear in mind D’Ignazio and Klein’s warning about zombie data, meaning copious production of poorly indexed, limited-context datasets.) This section also provides an introduction to some common data formats. Part four is also informational, defining two kinds of metadata, and how you can use it to understand what’s in a given dataset. In the final part of the lab, you’ll hop among potential data sources, to see how feasible it might be to find evidence related to your questions.
At various points, you will be expected to fill in information based on your own interests, findings, and previous responses.
While it can be freeing to have the universe open to your inquiry, there are also potential drawbacks to leaving topic selection entirely open: we could each end up swimming in our own separate oceans of research, making it very hard to lifeguard or even to bounce ideas back and forth with people who share your interests and burgeoning expertise. Even more so, if you’re uncertain where to begin, having no constraints can leave you aimless. So if you already have an idea for what you’d like to investigate, start there, and I’ll see if I can find some interest clusters as we move forward through the semester. But if you need a little more direction, consider whether one of the following broad topics leads you to a research question you’re curious about:
As you go about selecting a topic, you may consider what issues most matter to you, your family, and your community in the contemporary moment. The notebooks in this Lab Book are designed to support data research on just about any topic.
Fill response here.
Fill response here.
At this point, you are going to begin mapping out some of the project contexts. By this I mean that you are going to put your topic into temporal, geographic, and social-cultural perspective. Imagine that your topic is somehow depicted in the center of a plain white sheet of paper. What details would we need to fill into the background in order to bring this topic to life? We would need to add people. We would need to add environments. We would need to communicate the time and place in which the topic was being depicted. The eight questions below encourage you to draw out these contexts. Note that there is not necessarily a wrong way to answer these questions, and you absolutely do not need to answer them comprehensively: if we tried to respond to them “fully,” we’d probably go on writing forever! Instead I’d like you curate just a few things that come to mind when you consider your topic, because having some of these contexts written out will help you as you are searching for data.
For each question below, please respond in 1-2+ complete sentences.
Fill response here.
Fill response here.
Fill response here.
Fill response here.
Fill response here.
Fill response here.
Empirical research questions are questions that an analyst can assess evidence to address. Examples of empirical questions include:
Empirical research questions are specific to a particular time and place. Notice how I delimit my questions to the US above and the second question to a specific month.
Now think about your topic. What questions could you ask about your topic that would contribute to an understanding of the topic’s prevalence, how the topic impacts diverse communities, how the topic has changed over time, or how communities are equipped to respond to the topic? Be sure that your question is one that you can assess evidence to address, and that it is specific to a certain time and place. For example, if your topic is mental health, you might ask:
Note how, with the right data I could answer this question definitively.
Fill response here.
NB: Beware of setting up your question as a dichotomy! Did you use the word ‘or’ in your question above? I’ve seen many students do this in past assignments - asking questions like “Was this legislation beneficial to local communities, or was it harmful?” or “Did this technology fix inequities in the community, or did it sustain them?” In each of these questions, we have structured our research to test two conditions only. Yet, when it comes to studying complex contemporary issues, things are never black and white. It is highly likely that the situation we are examining in our research is much more complicated than these two conditions can capture. We should avoid structuring our research questions to test mutually exclusive categories. In placing false dichotomies in our research questions, we run the risk of oversimplifying complex causation, and we limit what the research can say.
If you did use a dichotomy, see if you can translate it now into a construction that allows for nuance. (For the first example above, you might break the dichotomy by subdividing who is affected: “For which groups in local communities was this legislation beneficial? For whom was it harmful?” Or you might instead question whether benefit and harm are mutually exclusive, allowing people to feel both: “What were the benefits and harms of this legislation to local communities?” Similar logic applies to the second example: not all inequities have to change in the same direction, or for all parties.)
Social-Theoretical questions are questions about how certain social, political, or environmental phenomena operate at a broad scale. They are much broader than empirical research questions. We often cannot answer these questions with one research project. However, we often aim to increase understanding of these questions through a research project, adding evidence for possible answers rather than definitively settling them. Examples of social-theoretical questions include:
Note how for each of these questions I would need to examine lots of different data and carry out a number of different projects to answer them.
Now think about your topic. What broader questions about how social, political, or environmental phenomena operate might you wish to address through research?
Fill in the table below with at least 5 datasets that would help you address your empirical research questions. Be specific, filling out their geographic scope and the timespan they would cover. In the first column characterize the topic of the dataset. In the second and third column, describe the geographic and temporal scope. Finally, in the last column, mark whether you believe such a dataset is accessible.
| Dataset | Geography | Timespan | Do you think this data is accessible? |
|---|---|---|---|
| Number of confirmed cases of Covid-19 | All countries across the globe | January 2020 to present | Yes |
| Fill | Fill | Fill | Fill |
| Fill | Fill | Fill | Fill |
| Fill | Fill | Fill | Fill |
| Fill | Fill | Fill | Fill |
| Fill | Fill | Fill | Fill |
In May 2009, Data.gov - a web portal for accessing US government datasets - was launched by then federal Chief Information Officer Vivek Kundra. Following this, in December 2009, then US President Barack Obama signed the Open Government Data Directive, requiring that all federal agencies post at least 3 high value datasets on data.gov within 45 days. A few years later in May 2013, Pres. Obama signed an Executive Order to: “Mak[e] Open and Machine Readable the New Default for Government Information.”
The Order required that the US Office of Management and Budgeting, in collaboration with the CIO and CTO, put out and oversee an Open Data Policy. This policy required the following:
See the relevant White House memo for more details.
Note that these are only requirements for data produced through the federal government. Cities, states, and counties have their own open data programs, policies, and laws, which are sometimes more and sometimes less stringent than the federal policy. However, most open data policies, in some way, deal with the three issues listed above - machine-readability, licensing, and metadata.
Let’s talk about what each of these issues entail:
Poirier tells a story about the importance of machine-readability (and how it isn’t always achieved): > Last quarter, in my Hack for California research cluster, one project was examining gentrification over the past ten years around and near the UC Davis Medical Center in Sacramento. One of the indicators we were examining in relation to gentrification was the number of construction permits the city had awarded in that area for demolitions, new buildings, and remodeling. The City of Sacramento has construction permit data from 2012 to present stored in Excel files on their website - one file of permits per month. One of the students in the research cluster (and in fact one of your stellar classmates) was able to write a script to download each of these files and bind them into one large file. However, we were examining gentrification over a much longer period of time and needed construction permits dating back to 1990. We knew that the City had this data because they had produced a public map, where you could search for an address in Sacramento and retrieve every construction permit it had been awarded since the early 1980s. We needed a data file that listed this for every address in the city. With this in mind, I submitted a public records request asking for the following: “I’m looking for a data file of all construction permits issued across Sacramento from 1990 to present in a downloadable, machine-readable format.” > > Two weeks later they sent me back a 9,543-page PDF document listing every construction permit awarded in the city since 1990. I took a deep breath. Had they sent this in a CSV file, we could have gotten to work immediately. It would take hours to get this hefty document into a format we could work with. I really hate PDFs.
Machine-readable data are data that can be readily processed by a computer. Typically machine-readable data are structured in ways that many different computer applications can recognize. As Poirier’s story above indicates, there are different degrees of machine-readability of digital data:
Name, Age, Birth Month, Time on Phone
Sally, 23, 3, 42
Julie, 40, 2, 98
Mark, 14, 8, 120
However, if we were to open the same file in Excel, each value would be separated into its own cell. A CSV file is software independent. As a standardized way of displaying data, just about any computer application that displays data is prepared to read a CSV file and format it for display.
While we won’t work with such formats in this course, data can be made even more machine-readable than a CSV file. Formats like XML and RDF allow us to structure data with much more specificity. They are often considered the gold standards of machine-readability. Sir Tim Berners-Lee, the inventor of the World Wide Web, often uses this diagram to outline the degrees of machine-readability of open data:
5-star open data
In the above chart, the acronyms are short for the following:
Just because something is available on the Web does not necessarily mean that we are free to download and use it as we please. Historically, different government agencies would allow access to certain datasets for a fee that would help to cover the costs of running public data programs. (Oftentimes, we hear this referred to as data being behind a “paywall.”) With the US Open Government Directive, all data produced by any federal US agency would default to the public domain. When data is in the public domain, they are owned by the the public. The data are not subject to any copyright or intellectual property law and can be accessed, modified, reproduced, and distributed without any restrictions.
Data acquired by any federal US agency needed to be given an open data license that met the following criteria:
There are a few global licenses that government agencies can apply to data that meet this criteria. One such license is the Creative Commons Universal Public Domain License (CC0 1.0).
Creative Commons is a non-profit organization that aims to increase the availability of creative works that the public can remix and share. Creative Commons has created a number of free licenses that the public can apply to their own creative works in order to designate the extent to which others can modify and redistribute them. These licenses indicate whether individuals other than the content creator may share the work, remix the work, and/or make money off of the work, along with whether such individuals have to attribute the content creator when sharing it. The following image outlines a number of Creative Commons licenses from most open to least open. You’ll notice that CC0 1.0 - the license compatible with each of the criteria listed above – is at the top of the image.
See this dataset of Towed Cars for the Past Thirty Days in Hartford, CT, which has been licensed with the CC0 1.0.
Hartford License
Another open license that meets the criteria listed above is the Open Data Commons Public Domain and Dedication License (PDDL); click here for specifics about the PPDL. See dataset of City Hall Electricity Usage in Boston, which has been licensed with the PDDL:
Different cities and states throughout the US will have different laws about the degree to which data should be openly licensed. In most open government data portals, you will be able to discern how data licensed in its administrative metadata – to which we will turn next.
Metadata is data about data. There are two kinds of metadata:
Administrative metadata will answer questions such as:
Check out the administrative metadata available for this open dataset detailing Pittsburgh 311 Calls. Here is a screenshot from the linked page, toward the bottom:
Under the About this Dataset section on the page, you’ll see a number of fields describing the data - how frequently the data is updated, who published the data, who to contact regarding the data, who has rights to use and distribute the data, when the data was created, and when the data was last updated. All of this metadata provides us with information about how this data is managed; in other words, it provides administrative metadata. Why is administrative metadata so important? Here are just a few reasons:
As an example of why this is important, consider this dataset documenting traffic accidents in Denver. There is a long preamble to the data, indicating important information about how the dataset gets updated. Every time an accident occurs it is entered into the dataset. However, when first reported, the Denver Police Department likely does not have all of the information about the accident. That information becomes available through investigations, and as it becomes available, the entries associated with that accident in the dataset are updated. The following disclaimer on the data portal provides important administrative metadata:
“Incidents that occurred at least 30 days ago tend to be the most accurate, although records are returned for incidents that happened yesterday. For motor vehicle crashes that are still under investigation and involve a serious bodily or fatal injury, some attributes will appear as,”UNDER INVESTIGATION." This is to help ensure that any court proceedings related to these incidents are not inadvertently hindered. Once the investigation is closed, all of the incident’s attributes will be visible. This dynamic nature of motor vehicle crash data means that content provided here today will probably differ from content provided a week from now. Likewise, content provided on this site will probably differ somewhat from crime statistics published elsewhere by the City and County of Denver, even though they draw from the same database."
While administrative metadata characterizes how the data is managed, descriptive metadata tells us about the content of a dataset. With descriptive metadata, we should be able to answer questions such as:
Check out some of the descriptive metadata for the same Pittsburgh 311 Requests above. (To find the information in the screenshot below, I clicked on “311 Data,” under “Resources”, and scrolled down.)
As part of this basic descriptive metadata, we can see that there are 543,267 rows in the dataset and (counting offscreen below the screenshot) 17 columns. We can see that each row (or observation) in the dataset represents one 311 Request, and we can see each of the columns (or variables) in dataset that describe a request.
This dataset is published in The Western Pennsylvania Regional Data Center, an open data portal that includes data from the City of Pittsburgh and surrounding areas. Many cities, counties, states, and countries now have their own versions of an open data portal where various government agencies under their jurisdictions publish datasets. You’ll notice that many of these portals look very similar, in one of two models. This is because, for the past decade, two data management platforms have become go-to resources for opening government data. The one shown above, CKAN, is an open-source system also used by the federal government for data.gov. The other main platform you’re likely to see is Socrata, a privately developed platform designed specifically for open government initiatives. A considerable number of open data platforms run on Socrata and will thus look and feel similar to this 311 database for Baltimore, MD:
Descriptive metadata is often documented in a data dictionary. Data dictionaries are tools for looking up what various variables and codes in a dataset refer to. As we’ve discussed, and as our readings have emphasized, any count that we produce or any measurement that we produce will always be dependent on how we define what we are counting or measuring. If we are going to count the number of cars on the road, we have to ask: What counts as a car? What will I include in the count and what will I exclude? If we were going to measure the height of a chair, we have to ask: According to what units am I going to measure the height? The numbers that we produce will be shaped by the choices we make regarding data collection. Without documenting those choices, the values won’t make much sense to others. Again, we must first define things in order to count or measure them. This is why it is so important that when we share our data, others can look up the definitions that we are using to produce our values.
With a data dictionary, a data analyst can look up how values were defined much like they would look up the definition of a word in the Oxford English Dictionary. For instance, in the also publishes a dataset documenting 911 Calls for Service, and in addition to the information available above, they provide an attachment to a document identifying what number codes and abbreviations in the dataset refer to (sometimes known as a code book):
attachments
codes
New Orleans also publishes a version of the 911 calls for service dataset. They have created a separate data dictionary document for recording descriptive metadata about the dataset.
attachments
Data Dictionary
Check out this really nice piece on FiveThirtyEight, detailing the extent to which different cities and states have made crime data available for public analysis.
Data dictionaries are often very important for crime data as there are a number of caveats to what can be publicly reported with regards to crime. Consider that New York City publishes a dataset documenting each arrest, the date it occurred, where it occurred, the crime suspected, and demographic details about the person arrested from the start of the year to the present.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
nyc_arrests <- fromJSON("https://data.cityofnewyork.us/resource/8h9b-rp9u.json")
nyc_arrests %>% head()
However, for certain crimes, such as rape, sharing the location of where the crime occurred can put individuals at considerable risk. When we filter this data to only represent rape arrests that have occurred in Manhattan, and then map it, you will notice that there are very few locations represented on the graph – 19 at the writing of this lab.
library(leaflet)
nyc_arrests %>%
filter(ofns_desc == "RAPE" & arrest_boro == "M") %>%
leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addMarkers(~as.numeric(longitude), ~as.numeric(latitude))
Does this mean that only 19 arrests have been made for rapes in Manhattan since the start of the year? It does not. From examining the data documentation, we learn that unlike other crimes in NYC, the location of all rape arrests are geocoded in the dataset to the address of the police station house of the police precinct where the rape occurred. This is to protect the privacy and anonymity of the victim.
Some data dictionaries are very robust, detailing not only what each column refers to, but also the expected type of data in that column, every value that can appear in that column, where data might be missing in that column and why, and information about how the values in that column were generated. Other data dictionaries include much less information, forcing a data analyst to make assumptions about what values mean. Sometimes, a data dictionary is easily accessible with a dataset. However, sometimes, it can be much harder to find descriptive metadata – at times because it is buried within complicated user interfaces, and at other times, because it has not been created at all.
In an ideal world, we would have rich metadata for every dataset published on an open government data portal; unfortunately, this is rarely the case. The robustness of the data documentation, along with its ease of accessibility, can indicate the extent to which data publishers have prioritized responsible stewardship of the data. It can also indicate the human, financial, and technical resources various governments have available for data management and stewardship. Some agencies that publish dozens of datasets on an open data portal employ just one person at their agency responsible for managing the publication and stewardship of all of those datasets, in addition to that person’s other duties.
Notably, without descriptive metadata, it is much more likely that we will make poor assumptions about what certain terms mean in the data. If you ever find yourself in a situation where you don’t know what a term in an open government dataset means, this is where administrative metadata can be important. Before drawing conclusions from the data, you should contact the designated contact person with a detailed message explaining points of confusion in the data. (You may also want to encourage them to document their response in a data dictionary!)
Now you will begin to search for open government or academic datasets related to your chosen topic. Below I have some recommendations to open data portals where students have found materials related to their topics in the past. However, I also encourage you to search for data in city and state open data portals.
Based on your research, fill the datasets most suitable for examining your topic into the chart below. Be sure to respond to each of the questions in the chart. I’ve included a data dictionary below so that you may reference what each column means.
I’ve included an example of a dataset we will use as an example for labs in the first row.
| Dataset? | Found? | Formats? | DD? | Timespan? | Geography? | Row? | Update_freq? | Source? |
|---|---|---|---|---|---|---|---|---|
| Coronavirus (Covid-19) Data in the United States | https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv | .csv | Yes | Jan 21, 2020 to Present | United States Counties | Cases per day by State/County | Daily | The New York Times |
| Fill | Fill | Fill | Fill | Fill | Fill | Fill | Fill | Fill |
| Fill | Fill | Fill | Fill | Fill | Fill | Fill | Fill | Fill |
| Fill | Fill | Fill | Fill | Fill | Fill | Fill | Fill | Fill |